# `xwalks` Descriptions
This directory contains crosswalk files which were used to map raw medical codes to the feature variables we use in our analyses. The code for this process is quite particular to our technical environment and so is not included in this repo, but the mappings given here should be sufficient to recreate the features set in a different environment. Our modeling features fall into several broad categories: **demographics**, **ED encounters**, **other encounters**, **diagnoses**, **labs**, **vital stats**, **medications**, and **procedures**. For each type of input data, we establish sensible groupings for raw codes consider these groupings in time windows of 0 to 30 days prior to ED visit, 30 days to 1 year prior to visit, 1 year to 2 years prior to visit, and any time in the 2 years prior to visit. For non-numeric feature types (i.e. diagnoses), we created flags (binary) and counts (0,1,2,...) for each category in each time window. For numeric feature types (e.g. labs), we generated summary statistics within each time window (e.g. min, max, and mean). The particulars of each crosswalk file are detailed below. For more information on the structure of the features data, please contact Cassidy Shubatt (<cshubatt@gmail.com>).

- `ccs_multi_dia_2015.csv`: mapping ICD-9 **diagnosis** codes to multi-level (more specific) [AHRQ CCS categories](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp)
- `ccs_multi_prc.csv`: mapping ICD-9 **procedure** codes to multi-level (more specific) AHRQ CCS categories
- `ccs_single_dia_2015.csv`: mapping ICD-9 **diagnosis** codes to single (more general) AHRQ CCS categories
- `ccs_single_prc.csv`: mapping ICD-9 **procedure** codes to single (more general) AHRQ CCS categories
- `ed_cc_classification.csv`: mapping **ED encounter triage** notes to chief complaint categories
- `lab_cat_to_loinc.csv`: mapping LOINCs to **lab** categories
- `lab_units.csv`: units of measurement for **lab** categories
- `loinc_table.csv`: additional information associated with LOINCs
- `README.md`: you're reading it!
- `rxnorm_med_name.csv`: mapping **medication** names and RXCUI codes to ATC level names
- `stress_test_codes.yml`: ICD-9 diagnosis and procedure codes used to construct key outcomes (stress test, catheterization, MACE, stent, and CABG)
- `zocat_xwalk_03_2021.csv`: mapping ICD-9 **diagnosis** codes to "zocats", a modification of the AHRQ CCS system, that additionally breaks out subcategories for clinical entities (e.g., pulmonary embolus) and symptoms (e.g., pain)

## Categories with Multiple Crosswalks
For diagnoses and procedures, there are multiple classification systems that would make sense to use to organize our features. Rather than arbitrarily choose one specific system, we create features for a variety of systems and leave it to the machine learning algorithms to work out which classifications are most useful for prediction. It is the case, for example, that many `zocat` variables are extremely similar (or even identical) to CCS single- or multi-level variables, but this is fine! Unlike traditional inference methods like OLS, LASSO and GBM are robust to highly collinear inputs.

## Categories without Crosswalks
For several of the feature categories mentioned above, there is no crosswalk file. This is because these features are not, in general, constructed from some set of universal raw codes. Each of these is detailed a little more below.
### Demographics
Unlike the above variables, demographics are as measured at the time of visit and do not reflect patient history. These features contain the following variables:
- Race (Categorical)
    - Black
    - Hispanic
    - Other
    - White
- Sex
- Age at Admit
- Adjusted Gross Income in Zipcode (Percent in Each Range)
    - $0-24,999
    - $25,000-49,999
    - $50,000-74,999
    - $75,000-99,999
    - $100,000-199,999
    - > $200,000
- Distance From Hospital
    - Miles
    - Log Miles
### Other Encounters
We create flags for past encounters with the hospital in the time windows above. We break encounters out by clinic ID (i.e. did the patient visit a cardiologist or a podiatrist) and length of stay.
### Vital Stats
Similarly to labs, vital statistics are numeric variables and so are summarized with statistics rather than counts. Vital stats contains the following variables:
- Heart rate
- Respiratory rate
- Height
- Weight
- Temperature
- Blood pressure (systolic)
- Blood pressure (diastolic)
- MAP
